Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

نویسنده

  • Vladimír Boza
چکیده

The set intersection problem is one of the main problems in document retrieval. Query consists of two keywords, and for each of keyword we have a sorted set of document IDs containing it. The goal is to retrieve the set of document IDs containing both keywords. We perform an experimental comparison of Galloping search and a new algorithm by Cohen and Porat (LATIN2010), which has a better theoretical time complexity. We show that the new algorithm has often worse performance than the trivial one on real data. We also propose a variant of the Cohen and Porat algorithm with a similar complexity but better empirical performance. Finally, we investigate influence of document ordering on query time.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Faster Exact Histogram Intersection on Large Data Collections Using Inverted VA-Files

Most indexing structures for high-dimensional vectors used in multimedia retrieval today rely on determining the importance of each vector component at indexing time in order to create the index. However for Histogram Intersection and other important distance measures this is not possible because the importance of vector components depends on the query. We present an indexing structure inspired...

متن کامل

An Effective Approach to Temporally Anchored Information Retrieval

We consider in this paper the information retrieval problem over a collection of time-evolving documents such that the search has to be carried out based on a query text and a temporal specification. A solution to this problem is critical for a number of emerging large scale applications involving archived collections of web contents, social network interactions, blog traffic, and information f...

متن کامل

Fast Sorted-Set Intersection using SIMD Instructions

In this paper, we focus on sorted-set intersection which is an important part in many algorithms, e.g., RID-list intersection, inverted indexes, and others. In contrast to traditional scalar sorted-set intersection algorithms that try to reduce the number of comparisons, we propose a parallel algorithm that relies on speculative execution of comparisons. In general, our algorithm requires more ...

متن کامل

The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K-List Similarity Search

We consider the problem of processing similarity queries over a set of top-k rankings where the query ranking and the similarity threshold are provided at query time. Spearman’s Footrule distance is used to compute the similarity between rankings, considering how well rankings agree on the positions (ranks) of ranked items (i.e., the L1 distance). This setup allows the application of metric ind...

متن کامل

Scheduling Intersection Queries in Term Partitioned Inverted Files

This paper proposes and presents a comparison of scheduling algorithms applied to the context of load balancing the query traffic on distributed inverted files. We put emphasis on queries requiring intersection of posting lists, which is a very demanding case for the term partitioned inverted file and a case in which the document partitioned inverted file used by current search engines can perf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013